A Model and a Language for Representing and Manipulating Annotated Text Collections
نویسندگان
چکیده
Traditionally, collections of texts are digitally represented as a set of documents containing the text along with some kind of markup to define extra information, like metadata, annotations, etc. We propose a different approach that models the textual information in a dual way: as a formatted sequence of characters, as well as a composition of a particular kind of objects, called textual objects. With them, it is possible to represent different structures over the same text, together with complex annotations. Manuzio is a statically typechecked language to define a schema of such textual objects, and to write complex queries and applications on them with a set of powerful operators. In this paper we introduce the foundation of our textual model, the main features of the language, as well as a sketch of a system to manage persistent collections of texts and execute Manuzio programs.
منابع مشابه
Named Entity Recognition in Persian Text using Deep Learning
Named entities recognition is a fundamental task in the field of natural language processing. It is also known as a subset of information extraction. The process of recognizing named entities aims at finding proper nouns in the text and classifying them into predetermined classes such as names of people, organizations, and places. In this paper, we propose a named entity recognizer which benefi...
متن کاملCorpus based coreference resolution for Farsi text
"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...
متن کاملInvestigating Ideological Manipulation in Subtitling Based on Farahzad’s CDA Model: A Case Study of The Salesman
Translation plays an important role in conveying and manipulating ideologies. Accordingly, this study sought to analyze the ideological elements in the English subtitles of the Persian movie The Salesman. The framework to find the driven ideological strategies in the translation of the Persian audio of the same movie was based on the critical discourse analysis (...
متن کاملA new model for persian multi-part words edition based on statistical machine translation
Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...
متن کاملManuzio: An Object Language for Annotated Text Collections
Traditionally, text collections are represented as text files with some kind of markup to define extra-textual information, like metadata, annotations, etc. We propose an approach which uses the natural structure of a literary text to build specialized objects abstractions on text collections, objects which can be used to make non-hierarchically nested, multi-level annotations, to create comple...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009